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ABSTRACT 

This paper deab with the representation of video sequences useful 
for tasks such, as long-term analysis, indexing or browsing. A Table 
Of Content and index creation algorithm is presented, as mil as 
additional tools involved in their creation. The proposed method 
does not assume any a priori knowledge about the content or the 
structure of the video sequence. It is therefore a generic technique. 
Some examples are presented in order to assess the performance of 
the algorithm. 

1- INTRODUCTION 

In the framework of video analysis for indexing application (for ex- 
ample application related to the MPEG-7 standard) , the represen- 
tation of video sequences is an important issue. It is not enough to 
describe the content of a video sequence but also to develop tech- 
niques which are able to automatically create these descriptions. In 
this paper, a new technique that automatically creates TOCh and 
indexes is presented. The proposed algorithm only relies on visual 
information, unlike other techniques which use both video and audio 
information (5]. 

The goal of the TOC is to define the structure of the video se- 
quence in a hierarchical fashion, tike in a text document. The origi- 
nal sequence should be subdivided in sub-sequences which can also 
be divided in shorter sub-sequences, etc. At the end of this division 
process, the shortest entity to be described is the micro-segment 
The index Is also a hierarchical structure but it does not describe 
the structure of the sequence but the occurrences of similar content. 

The creation of the TOC and index follows the same strategy. 
The whole process is divided into three parts: shot detection, seg- 
mentation of shots and shot clustering. The first step splits the se- 
quence into shots, which are the input data for the next steps. The 
second step divides each shot into micro-segment$ which are sub- 
components of a shot where the camera activity is homogeneous. 
These micro- segment* constitute the lowest level of the TOC or in- 
dex representations. Finally, the third step of the algorithm creates 
the hierarchical structures by clustering the detected video shots. 

This paper is structured as follows. Section 2 deals with the shot 
detection algorithm. The temporal segmentation of shots is detailed 
in section 3, whereas their clustering is explained in section 4. Fi- 
nally, some conclusions are driven in section 5. 

2. SHOT DETECTION 

The first step in the TOC and index creation splits the sequence 
nto shots. Taking into account that a shot is a set of contiguous 
ranies without editing effects, the algorithm has to detect the tran- 
itions between consecutive shots. These transitions can be abrupt 
x more sophisticated, like dissolves, fades, etc. The shot tietec- 
ion algorithm consists of two main steps: computation of the mean 
Displaced Frame Difference nxDFD curve and its segmentation into 
shots. 

2.1. Computation of the mDFD curve 

The mDFD curve is obtained using a variation of the Displaced 
J^^|^j^ : :^^mto account both luminance and chromi- 
: t)}k=r.u,v denotes the luminance (K) 
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Figure 1: mDFD for the news2 sequence (frame 21000 to 22000, 
thin line). The filtered curve (thick line) and the detected markers 
(horizontal segments) are also plotted. Sl-SlO depict the limits of 
the detected shots. 



and the chrominance (l/,V) components of frame at time t, the 
DFD is given by DFD h (i t j;t - X>t + 1) - fk(i,j.t + 1) - /*(* - 
rf»(*»J), J - - 1) and the mDFD by: 
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mDFD{t) = -pV ? Y\ \DFD b (i,j;t - l,t + 1)| 



(i) 



where /*- and l v are the image dimensions and t» J » are the weights 
for K, 17, V components. In the example shown in Fig. 1 the weights 
have been set to {w Y , , wv } = {1, 3 f 3 J , which are typical values. 
The highest peaks of the curve correspond to the abrupt transitions 
(frames 21 100, 21 195, 21C33 and 21724). On the other side, the 
oscillation from frame 21 260 to frame 21 279 corresponds to a dis- 
solve. The presence of large moving foreground objects in frames 
21 100-21 195 and 21 633-21 724 creates high level oscillations of the 
mDFD curve. 

2,2. Segmentation of the mDFD curve 

The second block of the shot detection algorithm detects the video 
editing effects. Our approach has been to adapt cl a ssical spatial 
segmentation strategies to the case of temporal segmentation. Most 
of rT* qq5r ^ i shot detection techniques use threshold-based segmen- 
tation to extract the highest peaks of the mDFD or another type 
of mono-dimensional curve [7]. Although a large number of shots 
can actually be detected by tab approach, this class of algorithms 
is not robust and quite sensitive to noise. It is in particular dif- 
ficult to detect peaks of small contrast corresponding to fading or 
special effects. The solution that we have used Is a homogeneity- 
based approach relying on morphological tools. The mDFD curve 
is processed following four steps: 

1. Simplification by temporal filtering. Morphological closing with 
a ID structuring element of length With this operation, 
negative peaks the length of which is less than Imjn frames are 
removed. Physically, the parameter defines the duratioQ 
of the shortest shot to be detected. t * : 



2. Simplification bp positive conhxiet filter* The effect of the filter 
is to remove positive peaks that have a positive contrast lower 
than the parameter c. 

3- Marker extraction. Each marker corresponds to the kernel of 
one shot. So, each marker must cover a portion of the curve 
with a high probability to belong to a single shot. Because 
contiguous frames that belong to the same shot are quite simi- 
lar, the value of the mDFD will be small around those frames. 
Thus, to extract the markeri* a negative contrast filter is used 
because it detects the relative minimums of the curve. The 
same c parameter than in the previous step is used in the neg- 
ative contrast filter. 

4. Watershed. This is the final step and its purpose b to propagate 
the markers on the curve until all points are assigned to a 
marker. The propagation process is performed by applying a 
watershed algorithm [8] on the mDPD curve using as initial 
markers those obtained in the previous step. 

Figure 1 shows the filtered curve (both temporal filter and pos- 
itive and negative contrast filter), the resulting markers and de- 
tected shots using { roan = 10 and c = 10. Even though some over- 
segmentation appears around frames 21 150 and 21 700, both the 
Bcene cuts and the dissolve have been correctly detected. It is im- 
portant to mention that the over-segmentation ia not a problem 
because the next steps of the algorithm eliminate this effect. 

3. TEMPORAL SEGMENTATION OP VIDEO SHOTS 

USING CAMERA MOTION PARAMETERS 

The goal of the temporal segmentation algorithm is to split each shot 
into several micro-segments which present a high level of homogene- 
ity on the camera motion parameters. The data used to perform the 
segmentation are the camera motion parameters currently used in 
MPEG-7 [2], This algorithm is applied to each shot separately and 
consists of two steps: 1) each shot is over segmented into several 
micro-segments which must present a perfect homogeneity (see sec- 
tion 3.1) , and 2) a merging process is applied while the homogeneity 
level of the set of micro- segments is below a predefined threshold. 

In order to segment the shot, it is necessary to define a distance to 
compare sub-segments and a parameter which allows the assessment 
of the quality of a micro-segment or a partition (i.e., set of micro- 
segments). In both cases, we use a motion histogram, denned as 
follows: 

Ni 

H*{i] = — , t e {PanLeft, PanRight, Zoomln, ZoomOut, Fix, . . . } 
^* 

(2) 

where * represents the label of the segment inside the shot, i the 
motion type, L, the length of segment 8 and JV £ the number of 
frames of segment $ with motion type t\ Note that each bin of the 
histogram R shows the percentage of frames with a specific type of 
motion. ^,H 9 [\\ mav be higher than 1.0, because different motions 
can appear concurrently. 

3.1. Homogeneity 

We assume that a segment is perfectly homogeneous when it presents 
a single combination of camera motion parameters along all its 
frames. A segment is not homogeneous when it presents impor- 
tant variations on these parameters. The segment homogeneity is 
computed on its histogram (Eq. 2). If a segment is perfectly ho- 
mogeneous, the histogram bins are equal to either 1.0 or 0 0- If a 
segment is sot perfectly homogeneous, the bins can present inter- 
mediate values. 

lb measure the segment homogenertyrwe measure how much its 
histogram differs from the ideal one. The distance corresponding to 
bins with high values is the difference between the bin value and 1.0. 
Analogously, for bins with small values, the distance b the bin value 
itself. Mathematically, the homogeneity of a segment b is given by: 

,..,^v^.lC.,^y,,.^^...( 0 = / l-° ~ a - H. * *-[«] > 0.5 



where if, is the histogram of the segment * and t indicates the 
different motion types. . The homogeneity of a shot S is equal to 
the homogeneity of its segments weighting by the length of each of 
them, as shown in the following equation: 




i=l 



where L s = X2^U « the total length of shot S and N the 
number of segments it contains. Note that small values of H 
correspond to high levels of homogeneity. 

The distance between two segments, s\ and 02, is the homogeneity 
of the segments union. 

d(*i,*j) = U* 2 ) (5) 

3.2. Algorithm 

The temporal segmentation algorithm consists of the following steps: 

1. Initial over- segmentation. In this step, the shot is over- 
segmented in order to obtain a sett of per fee tip homogeneous 
micro-segments. Mathematically, the following condition must 
be fulfilled: 

71(b) =0,V*e S 

2. Fusion order. In this step, the distance between all neighboring 
segments (temporally connected) is computed using Eq. 5. The 
closest pair of segments ia then selected for possible merging in 
the next step. 

3. Fusion criterion. To decide if the selected pair of segments 
are going to be merged, we compute the homogeneity of the 
shot (Eq, 4) assuming than the minimum distance segments 
have already been merged and then the following criterion is 
applied: 

f Merge, if H(S) < 

^ Do not merge, if U(S) > 0* 

Note that the fusion criterion is global: the decision depends 
on the homogeneity of the resulting partition and not — 
exclusively — on the homogeneity of the resulting segment. If 
the merging is done, a new iteration starts at the second step 
of the algorithm. The merging process ends when there is no 
pair of neighboring segments that can he merged. 

The algorithm allows different levels of detail as shown in the 
example of Fig. 2. 

4. SHOT CLUSTERING 

The shot clustering process is divided into two parts: shot merging 
and tree structuring. In the first step, pairs of shots are grouped 
together creating a binary tree. In the second step, the binary tree 
is restructured in order to reflect the similarity present in the video 
sequence. 

4.1. Shot merging 

The shot merging algorithm yields a binary tree which represents 
the merging order of the initial shots (see section 2. In this tree, the 
leaves represent the initial shots, the top node represents the whole 
sequence and the intermediate nodes represent sequences that are 
created hy the merging of several shots. The merging criterion is 
denned by a distance between shots, merging first the closest shots. 
In order to compute the distance between shots it is accessary to 
define a shot model that provides the features to be compared, and 
to set the neighborhood links between them, which indicate what 
merging can be done. The TOO and index creation only differ in 
the neighborhood links. The former sets a link between each pair of 
temporally connected shots, the latter between aO pair of shots. 

The process ends when all the initial shots have been merged into ..... 
a single node or when the minimum distance between all couples g 
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Figure 2: Example of temporal segmentation of shot #8 of MPEG-7 
nhkvideo sequence. Above: Camera motion parameters chronogram; 
below: Two segmentations using different levels of homogeneity. 




Figure 4: Binary tree created by the shot merging algorithm. The 
initial shots are shown in Fig. 3. The top node represents the whole 
sequence, while the leaf nodes represent the initial shots. 



of linked nodes is greater than the specified threshold [10]. In the 
tatter case, not a single tree but a set of binary trees — a forest of 
binary trees — la obtained. 

4.1.1, Shot and sequence model 

The shot model must allow us to compare the contest of several shots 
to decide what shots must be merged and which is their merging 
order. In still Images, luminance and chrominance are the main 
features of the image [9 T 3], In a video sequence) due to the temporal 
evolution, motion is an important source of information [3, 6, 4]. So, 
average images, histograms of l«*«tiiM/Tft/rh*wmiii»n*» information 
{YUV components) and motion information (r and y components 
of motion vectors) are used to model the shots. 

4.1.2. Merging distance 

In order to compute the distance between a pair of nodes, three 
different cases must be considered depending on the type of nodes. 
At the beginning of the merging process, the nodes represent a single 
shot: the average of the Euclidean distance between the different 
components of the model is used. In the following steps of the 
process, the distance must be computed between nodes that may 
represent more than a single shot: the minimum and maximum 
distance between all pair of shots (one from each node) is used [1]. 

4-1*8' Merging algorithm 

The merging algorithm consists of three preliminary steps and the 
merging process itself. In the first step, nodes are modeled according 
to the model previously described. In the second step, neighborhood 
links between shots are set and in the third step the distance between 
linked shots is computed. 

i Since the whole algorithm must be able to create both the TOC 
and the index of a video sequence, the criterion used to set the neigh- 
Iborhood links differs depending on the main objective. In the TOC 
creation process, neighborhood links are set according to temporal 
tonnectivity. So, a pair of shots will be linked and therefore may 
I e merged only If they are consecutive in the time domain. On the 
father hand, all pair of shots will be linked and therefore may be 
merged in the index creation process. 

Qnce the algorithm has been initialized, the merging process 
starts performing the following steps: 

• Get minimum distance link. Both the minimum and the max- 
^^?^^^ , -^^ v SS^p»P ,ute *fe every pair of linked nodes. Be- 
@^^^9^^^|^^imiim distance, the maximum distance b 



checked. If maximum distance is higher than the maximum 
distance threshold dmax, then the link is discarded. Otherwise, 
the link is taken into account. Once all links have been scanned, 
the minimum distance link is obtained. 

• Check distance criterion. In order to decide if the nodes pointed 
by the minimum distance link must be merged, the minimum 
distance is compared to the minimum distance threshold dm**- 
-If the minimum distance is higher than the threshold,, no merg- 
ing is performed and the process ends. Otherwise, pointed 
nodes are merged and the process goes on. 

• Merge nodes. Nodes pointed by the minimum distance links 
are merged. 

• Update links. This steps update those links to take into account 
the merging that has been done. 

• Update distances- Once links have been updated, the distance 
of those links which point to the new node is recomputed. 

• Chech tap node This step checks the number of remaining 
nodes. If all initial shots have been merged into a single node, 
the process ends. Otherwise, a new iteration begins. 

The merging process may yield a single tree if all the initial shots 
are similar enough or a forest if initial shots are quite different. 

An example of binary tree for TOC creation is shown in Fig. 4. 
Inside the leaf nodes of this tree, we have indicated its label, and 
between brackets [...] the starting and ending frame number of the 
shot. Inside the remaining nodes, we have indicated its label, be* 
tween parenthesis (...) the fusion order and between brackets the 
minimum and maximum distance between its two siblings- 

4.2* Tree structuring; 

The purpose of the this algorithm is to restructure the binary tree 
into an arbitrary tree that should reflect more clearly the video 
structure. Tb this end, nodes that have been created by the binary 
merging process but that do not convey any relevant information 
should be removed. The criterion used to decide if a node must 
appear in the final tree is based on the variation of the similarity 
degree (distance) between the shots included in the node: 

• If the analyzed node is the root node (or one of the root nodes, 
if various binary trees have been obtained after the merging 
algorithm), then the node should be preserved in the final tree. 

• If the analyzed node Is a leaf node (i.e.. it corresponds to || 
initial shot), then it has also to remain in the final tree. || 
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Figure 5: Tree yielded by the tree structuring algorithm. 
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* Otherwise, the node will be kept in the final tree if the following 
conditions are satisfied: 

|dmm (analyzed node) - 4m (parent node)| < G 
(analyzed node) - 0%*** (parent node)| < 0 

As shown in Fig. 5, the tree resulting from the restructuring step 
represents more clearly the structure of the video sequence. Nodes 
in the first level of the hierarchy (28,12,13,21) represents the four 
scenes of the sequence, white nodes in the third —or occasionally 
fourth— represent the initial shots. 

5. CONCLUSIONS 

In this paper, techniques for TOO and index creation hat* been pre- 
sented, video shot detection and micro-segment segmentation have 
been presented. The shot detection algorithm is able to cope with 
abrupt transitions as well as smooth ones even though it over- 
segments those portions of the video sequence with a high level 
of motion, In order to solve this drawback, a confidence mask will 
be added to the motion vectors to discard those parts of the image 
which can not be properly estimated when computing the mDFD. 

The ability of the micro-segment segmentation algorithm to com- 
pute partitions at difierent level of detail allows to adjust the gran- 
ularity of the resulting TOO or index to take into account different 
appHcations. Moreover, the algorithm is quite robust in front of 
mistakes in the camera motion parameters. 

Finally, the TOG and index creation algorithm provides an uni- 
fied method to compute both structures. Moreover, new pieces of 
information can easily be introduced in the merging process. For 

^^1^^^^°^^.^ <kv*loped to deal with 
F^S^^HSSP^wfeaae of knowing before hand some charac- 
t*htacs of fee ' se^ence (e.g. the type of video), this information can 



be taken into account to improve the performance of the algorithm 
in specific scenarios. 
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CLAIMS : 

1. A preprocessing method provided for defining, before tasks such as long- 
term analysis, indexing or browsing, the structure of a video sequence and describing it, 
said method comprising a first step, provided for creating a table of content, that gives 

5 said structure in a hierarchical fashion, and indexes, that describe the occurrences of 

similar content. 

2. A method according to Claim 1, in which the creation of the table of content 
and indexes is carried out on the basis of a strategy comprising the following sub-steps : 
a first splitting sub-step for sub-dividing the sequence into shots, a second splitting sub- 

10 step for sub-dividing each shot into shorter entities called micro-segments homogeneous 

according to a predetermined criterion, and a third clustering sub-step for creating said 
hierarchical structure by clustering the detected video shots. 

3. A method according to Gaim 2, in which said first splitting sub-step 
. _ _ comprises the following operations : a computation of the mean displaced frame 

15 difference curve, obtained using a variation of the displaced frame difference to take into 

account both luminance and chrominance information, and a segmentation of said curve, 
based on the detection of video editing effects. 

4. A method according to claim 3, in which said curve is processed following 
four steps : a simplification by temporal filtering, a simplification by positive contrast 

20 filtering, a marker extraction, and a propagation of the markers until all points are 

assigned to a marker. 

5. A method according to anyone of claims 2 to 4, in which each shot is splitted 
into several micro-segments presenting a high level of homogeneity on the motion 
parameters of the camera by which said video sequence has been delivered. 

25 6. A method according to anyone of claims 2 to 5, in which said third clustering 

sub-step is divided into a first shot merging operation, provided for creating a binary tree 
by grouping together pairs of shots, and a second tree structuring operation, provided for 
restructuring said tree in order to reflect the similarity present into the video sequence. 
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